Robust optimal quantum learning without quantum memory 
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A quantum learning machine for binary classification of qubit states that does not require quantum 
memory is introduced and shown to perform with the minimum error rate allowed by quantum 
mechanics for any size of the training set. This result is shown to be robust under (an arbitrary 
amount of) noise and under (statistical) variations in the composition of the training set, provided 
it is large enough. This machine can be used an arbitrary number of times without retraining. 
Its required classical memory grows only logarithmically with the number of training qubits, while 
its excess risk decreases as the inverse of this number, and twice as fast as the excess risk of an 
"estimate-and-discriminate" machine, which estimates the states of the training qubits and classifies 
the data qubit with a discrimination protocol tailored to the obtained estimates. 



Quantum computers are expected to perform some 
(classical) computational tasks of practical interest, e.g., 
large integer factorization, with unprecedented efficiency. 
Quantum simulators, on the other hand, perform tasks 
of a more "quantum nature" , which cannot be efficiently 
carried out by a classical computer. Namely, they have 
the ability to simulate complex quantum dynamical sys- 
tems of interest. The need to perform tasks of genuine 
quantum nature is emerging as individual quantum sys- 
tems play a more prominent role in labs (and, eventually, 
in everyday life). Examples include: quantum tclcporta- 
tion, dynamical control of quantum systems, or quantum 
state identification. Quantum information techniques are 
already being developed in order to execute these tasks 
efficiently. 

This paper is concerned with a simple, yet fundamental 
instance of quantum state identification. A source pro- 
duces two unknown pure qubit states with equal proba- 
bility. A human expert (who knows the source specifica- 
tions, for instance) classifies a number of 2n states pro- 
duced by this source into two sets of size roughly n (sta- 
tistical fluctuations of order yfn should be expected) and 
attaches the labels and 1 to them. We view these 2n 
states as a training sample, and we set ourselves to find 
a universal machine that uses this sample to assign the 
right label to a new unknown state produced by the same 
source. We refer to this task as quantum classification 
for short. 

Quantum classification can be understood as a super- 
vised quantum learning problem, as has been noticed by 
Guta and Kotlowski in their recent work [l[ (though they 
use a slightly different setting). Learning theory, more 
properly named machine learning theory, is a very ac- 
tive and broad field which roughly speaking deals with 
algorithms capable of learning from experience [2(]. Its 
quantum counterpart [3|-|7( not only provides improve- 
ments over some classical learning problems but also has 
a wider range of applicability, which includes the prob- 
lem at hand. Quantum learning has also strong links with 
quantum control theory and is becoming a significant el- 
ement of the quantum information processing toolbox. 



An absolute limit on the minimum error in quantum 
classification is provided by the so called optimal pro- 
grammable discrimination machine [8l-lll|. In this con- 
text, to ensure optimality one assumes that a fully gen- 
eral two-outcome joint measurement is performed on both 
the 2n training qubits and the qubit we wish to classify, 
where the observed outcome determines which of the two 
labels, or 1, is assigned to the latter qubit. Thus, in 
principle, this assumption implies that in a learning sce- 
nario a quantum memory is needed to store the training 
sample till the very moment we wish to classify the un- 
known qubit. The issue of whether or not the joint mea- 
surement assumption can be relaxed has not yet been 
addressed. Nor has the issue of how the information left 
after the joint measurement can be used to classify a sec- 
ond unknown qubit produced by the same source, unless 
a fresh new training set (TS) is provided (which may 
seem unnatural in a learning context). 

The aim of this paper is to show that for a sizable TS 
(asymptotically large n) the lower bound on the prob- 
ability of misclassifying the unknown qubit set by pro- 
grammable discrimination can be attained by first per- 
forming a suitable measurement on the TS followed by 
a Stern-Gerlach type of measurement on the unknown 
qubit, where forward classical communication is used to 
control the parameters of the second measurement. The 
whole protocol can thus be undersood as a learning ma- 
chine (LM), which requires much less demanding assump- 
tions while still having the same accuracy as the optimal 
programmable discrimination machine. All the relevant 
information about the TS needed to control the Stern- 
Gerlach measurement is kept in a classical memory, thus 
classification can be executed any time after the learn- 
ing process is completed. Once trained, this machine 
can be subsequently used an arbitrary number of times 
to classify states produced by the same source. More- 
over, this optimal LM is robust under noise, i.e., it still 
attains optimal performance if the states produced by 
the source undergo depolarization to any degree. Inter- 
estingly enough, in the ideal scenario where the qubit 
states are pure and the TS consists in exactly the same 
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number of copies of each of the two types 0/1 (no statisti- 
cal fluctuations are allowed) this LM attains the optimal 
programmable discrimination bound for any size 2n of 
the TS, not necessarily asymptotically large. 

At this point it should be noted that LMs with- 
out quantum memory can be naturally assembled 
from two quantum information primitives: state estima- 
tion and state discrimination. We will refer to these spe- 
cific constructions as "estimate-and-discriminate" (E&D) 
machines. The protocol they execute is as follows: by 
performing, e.g., an optimal covariant measurement on 
the n qubits in the TS labeled 0, their state \ipo) is es- 
timated with some accuracy, and likewise the state \tpi) 
of the other n qubits that carry the label 1 is charac- 
terized. This classical information is stored and subse- 
quently used to discriminate an unknown qubit state. It 
will be shown that the excess risk (i.e., excess average 
error over classification when the states \ipo) and are 
perfectly known) of this protocol is twice that of the op- 
timal LM. The fact that the E&D machine is suboptimal 
means that the kind of information retrieved from the TS 
and stored in the classical memory of the optimal LM is 
specific to the classification problem at hand, and that 
the machine itself is more than the mere assemblage of 
well known protocols. 

We will first present our results for the ideal scenario 
where states are pure and no statistical fluctuation in the 
number of copies of each type of state is allowed. The 
effect of these fluctuations and the robustness of the LM 
optimality against noise will be postponed to the end of 
the section. 



RESULTS 



valued measure (POVM) £ = {E 0l E 1 = t - E Q }. The 
minimum average error probability of the quantum clas- 
sification process is given by P e = (1 — A/2)/2, where 
A = 2max£ J dipQ dip\ tr [(glfi — q") E ]. This average 
can be cast as a SU(2) group integral and, in turn, read- 
ily computed using Schur's lemma to give 
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where || • ||i is the trace norm and are average states 
defined as 
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In this paper l m stands for the projector on the fully 
symmetric invariant subspace of m qubits, which has di- 
mension d m — m + 1. Sometimes, it turns out to be more 
convenient to use the subsystem labels, as on the right 
of @. The maximum in (fTJ) is attained by choosing Eq 
to be the projector onto the positive part of (Tq — c™. 

The right-hand side of (JXJ) can be computed by switch- 
ing to the total angular momentum basis, {|J, M)}, 
where i < J < n + | and —J<M<J (an additional 
label may be required to specify the way subsystems cou- 
ple to give J; see below). In this (Jordan) basis [10| 
the problem simplifies_ significantly, as it reduces to pure 
state discrimination [l2j on each subspace corresponding 
to a possible value of the total angular momentum J and 
magnetic number M. By writing the various values of the 
total angular momentum as J = k + | , the final answer 
takes the form fill: 



Programmable machines. Before presenting our re- 
sults, let us summarize what is known about optimal ma- 
chines for programmable discrimination. This will also 
allow us to introduce our notation and conventions. Ne- 
glecting statistical fluctuations, the TS of size 2n is given 
by a state pattern of the form [ipf n ] <8> [ipf n ] , where the 
shorthand notation [ • ] = | ■ ) ( • | will be used through- 
out the paper, and where no knowledge about the ac- 
tual states IV'o) and \ipi) is assumed (the figure of merit 
will be an average over all states of this form). The 
qubit state that we wish to label (the data qubit) be- 
longs either to the first group (it is {4>o}) or to the second 
one (it is [ipi])- Thus, the optimal machine must dis- 
criminate between the two possible states: either Qq = 

[ipf" n+1 ^] A B ® [^f n ]ci m which case it should output 
the label 0, or q1 = [ipf n ] A ® [V>f ( " +1) ]sc, in which case 
the machine should output the label 1. Here and when 
needed for clarity, we name the three subsystems involved 
in this problem A, B and C, where AC is the TS and B 
is the data qubit. In order to discriminate Qq from p™, a 
joined two-outcome measurement, independent of the ac- 
tual states IV'o) and IV'i), is performed on all 2n+l qubits. 
Mathematically, it is represented by a positive operator 
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A simple asymptotic expression for large n can be com- 
puted using Euler-Maclaurin's summation formula. After 
some algebra one obtains 



popt 



1 

3n 



(4) 



The leading order (1/6) coincides with the average error 
probability / dip dtpip° pt (ip ,ipi), where p° pt (-00, ipi) is 
the minimum error in discrimination between the two 
known states \ipo) and l^i)- 

Learning machines. The formulas above give an ab- 
solute lower bound to the error probability that can be 
physically attainable. We wish to show that this bound 
can actually be attained by a learning machine that uses 
a classical register to store all the relevant information 
obtained in the learning process regardless the size, 2n, 
of the TS. A first hint that this may be possible is that 
the optimal measurement £ can be shown to have pos- 
itive partial transposition with respect to the partition 
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TS/data qubit. Indeed this is a necessary condition for 
any measurement that consists of a local POVM on the 
TS whose outcome is fed-forward to a second POVM on 
the data qubit. This class of one-way adaptive measure- 
ment can be characterized as: 
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where the positive operators (D^) act on the Hilbcrt 
space of the TS (data qubit we wish to classify), and 
£^L M = 1„ ® l n . The POVM £ = {L M } represents 
the learning process, and the parameter /i, which a 
priori may be discrete or continuous, encodes the 
information gathered in the measurement and required 
at the classification stage. For each possible value 
of fi, D M = {D^,li — D^} defines the measurement 
on the data qubit, whose two outcomes represent the 
classification decision. Clearly, the size of the required 
classical memory will be determined by the information 
content of the random variable fi. 

Covariance and structure of £. We will next prove 
that the POVM £, which extracts the relevant informa- 
tion from the TS, can be chosen to be covariant. This 
will also shed some light on the physical interpretation 
of the classical variable [i. The states ((2|) are by def- 
inition invariant under a rigid rotation acting on sub- 
systems AC and B, of the form U = Uac ® u , where 
throughout this paper, U stands for an element of the 
appropriate representation of SU(2), which should be ob- 
vious by context (in this case Uac = «® 2n , where u is 
in the fundamental representation). Since tr (EqcTq^) — 

tiiEoUiafyJJ) = tr([/£J Z7 t o£ /1 ), the positive opera- 
tor UEoU^ gives the same error probability as Eq for 
any choice of U [as can be seen from, e.g., Eq. (JTJ]. The 
same property thus holds for their average over the whole 
SU(2) group Eq = J du UEqU', which is invariant under 
rotations, and where du denotes the SU(2) Haar measure. 
By further exploiting rotation invariance (see Sec. Meth- 
ods for full details) Eq can be written as 



E Q = du (U AC SIW AC ) ® (u[t]u 
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for some positive operator Q, where we use the short 
hand notation [f] = 1 5, 3X5)5 • Similarly, the second 
POVM element can be chosen to be an average, E\, of 
the form ©, with [\] = |l, -i^i, -l| instead of [t]. 
We immediately recognize £ = {Eo,Ei} to be of the 
form (JSJ), where u, L u = UacSIU ac and D u = 
play the role of /1, and respectively. Hence, w.l.o.g. 

we can choose L = {Uac SI ^jic}su(2)j which is a co- 
variant POVM with seed fi. Note that u entirely defines 
the Stern-Gerlach measurement, D u = Ju^ , 
i.e., u specifies the direction along which the Stern- 
Gerlach has to be oriented. This is the relevant infor- 
mation that has to be retrieved from the TS and kept in 
the classical memory of the LM. 



Covariance has also implications on the structure of Q. 
In Sec. Methods, we show that this seed can always be 
written as 



fl = fl r . 
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where 



V " (j,m\Sl m \j,m) 
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= 2j + 1, < j < n, (8) 



and j (m) stands for the total angular momentum jac 
(magnetic number rriAc) of the qubits in the TS. In other 
words, the seed is a direct sum of operators with a well 
defined magnetic number. As a result, we can interpret 
that points along the z-axis. The constrain © en- 
sures that £ is a resolution of the identity. 

To gain more insight into the structure of SI, we trace 
subsystems B in the definition of A, given by the first 
equality in Eq. ([1]). For the covariant POVM ©, rota- 
tional invariance enables us to express this quantity as 

A LM = 2 max tr{(cr l -a?)Sl® [t ]} = 2 maxtr(r t ft), (9) 



where we have defined 
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(the two resulting terms in the right-hand side are 
the post-measurement states of AC conditioned to the 
outcome f after the Stern-Gerlach measurement T) z 
is performed on B) and the maximization is over 
valid seeds (i.e., over positive operators f2 such that 
/ du Uac SI W AC = Iac)- We calculate r t in Sec. Meth- 
ods. The resulting expression can be cast in the simple 
and transparent form 
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where J^ C is the z component of the total angular mo- 
mentum operator acting on subsystem A/C, i.e., on the 
training qubits to which the human expert assigned the 
label 0/1. Eq. (fiT]) suggests that the optimal fl should 
project on the subspace of A (C) with maximum (min- 
imum) magnetic number, which implies that rriAc = 0. 
An obvious candidate is 



n = m, |^) = ^ v / 2jTT|j, o) 
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Below we prove that indeed this seed generates the 
optimal LM POVM. 

Optimality of the LM. We now prove our main re- 
sult: the POVM £ = {E Q ,Ei}, generated from the seed 
state in Eq. (Tl"2"|) . gives an error probability P e LM = (1 — 
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A LM /2)/2 equal to the minimum error probability P° pt 
of the optimal programmable discriminator, Eq. ([3]). It 
is, therefore, optimal and, moreover, it attains the abso- 
lute minimum allowed by quantum physics. 

The proof goes as follows. From the very definition of 
error probability, 
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where we have used rotational invariance. We can further 
simplify this expression by writing it as 
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To compute the projections inside the norm signs we first 
write |</>°)|t) (I'/' )!!) w iH be considered below) in the 
total angular momentum basis | J, M) (AaiB , where the at- 
tached subscripts remind us how subsystems A, B and C 
are both ordered and coupled to give the total angular mo- 
mentum J (note that a permutation of subsystems, prior 
to fixing the coupling, can only give rise to a global phase, 
thus not affecting the value of the norm we wish to com- 
pute). This is a trivial task since |0°)|t) = \4>°) AC It) b, 
i. e., subsystems are ordered and coupled as the subscript 
(AC)B specifies, so we just need the Clebsch-Gordan co- 
efficients 



(j±!,!lj,o ; i ,i) = ±4 
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The projector t A ® Igp, however, is naturally writ- 
ten as 1a ® t B c = J2j.m \J-, m )a{cb){J-,M\. This basis 
differs from that above in the coupling of the subsys- 
tems. To compute the projection 1 A ® lscl ( / ,0 )lt) wc 
only need to know the overlaps between the two bases 
a(cb{J, M\J, M) (AC)B . Wigner's 6j-symbols provide this in- 
formation as a function of the angular momenta of the 
various subsystems (the overlaps are computed explicitly 
in Sec. Methods). 

Using the Clebsch-Gordan coefficients and the overlaps 
between the two bases, it is not difficult to obtain 
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\j~ \i\)a{cb), (17) 



An identical expression can be obtained for 
t AB (8> lc\(/>°)\\) in the basis \J,M\ BA)C . To finish 
the proof, we compute the norm squared of (JT7J) 
and substitute in (fl~5|) . It is easy to check that this 
gives the expression of the error probability in ([3]), 
i.e., R LM = P°p*. 



Memory of the LM. Let us go back to the POVM 
condition, specifically to the minimum number of unitary 
transformations needed to ensure that, given a suitable 
discretization f du — > of ©, {p^U^cj) is 

a resolution of the identity for arbitrary n. This issue 
is addressed in [l3j . where an explicit algorithm for 
constructing finite POVMs, including the ones we need 
here, is given. From the results there, we can bound the 
minimum number of outcomes of £ by 2(n + l)(2n + 1). 
This figure is important because its binary logarithm 
gives an upper bound to the minimum memory required. 
We see that it grows at most logarithmically with the 
size of the TS. 

E&D machines. E&D machines can be discussed 
within this very framework, as they are particular in- 
stances of LMs. In this case the POVM L has the form 
L m = M Q (g) M[, where M = {M a } and M' = {M[} 
are themselves POVMs on the TS subsystems A and C 
respectively. The role of M and M' is to estimate (op- 
timally) the qubit states in these subsystems 14]. The 
measurement on B (the data qubit) now depends on the 
pair of outcomes of M and M': B a j = {D a i, li — D a i}. 
It performs standard one-qubit discrimination according 
to the two pure-state specifications, say, the unit Bloch 
vectors Sq and s\, estimated with M and M'. In this sec- 
tion, we wish to show that E&D machines perform worse 
than the optimal LM. 

We start by tracing subsystems AC in Eq. (TIJ, which 
for E&D reads 

A E&D = 2 max tr B max tucKon - a?)E l (18) 

M,M' {D ai } 

If we write A E&D = maxj^^' ^m,m'> we have 



A M ,M' = ^2p a p'i\ro - r\\, 



(19) 



where Tq and r\ are the Bloch vectors of the data qubit 
states 

p a o=-tr A [-^M a ), p\ = ±tvc i-f^Ml , (20) 
Pa yd n+1 J p t \d n+ i J 

conditioned to the outcomes a and i respectively, 
and p a = d^trMa, p\ = d^trM- are their proba- 
bilities. We now recall that optimal estimation neces- 
sarily requires that all elements of M must be of the 
form M a = c a U a [^°]U a , where \ip°) = c a > 0, 

and {U a } are appropriate SU(2) rotations (analogous 
necessary conditions are required for 3Vt') [15||. Substi- 
tuting in Eq. (|20[) we obtain p a = c a /d n , and 
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(21) 



(a similar expression holds for p\). This means that 
the Bloch vector of the data qubit conditioned to out- 
come a is proportional to Sq (the Bloch vector of the cor- 
responding estimate) and is shrunk by a factor n/d n+ i = 
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n/(n + 2) = i]. Note in passing that the shrinking fac- 
tor rj is independent of the measurements, provided it is 
optimal. 

Surprisingly at first sight, POVMs that are optimal, 
and thus equivalent, for estimation may lead to different 
minimum error probabilities. In particular, the continu- 
ous covariant POVM is outperformed in the problem at 
hand by those with a finite number of outcomes. Op- 
timal POVMs with few outcomes enforce large angles 
between the estimates Sg and s\, and thus between 
and r\ (ir/2 in the n = 1 example below). This translates 
into increased discrimination efficiency, as shown by (1191) , 
without compromising the quality of the estimation it- 
self. Hence the orientation of M relative to M' (which 
for two continuous POVMs does not even make sense) 
plays an important role, as it does the actual number of 
outcomes. With an increasing size of the TS, the opti- 
mal estimation POVMs require also a larger number of 
outcomes and the angle between the estimates decreases 
in average, since they tend to fill the 2-sphere isotropi- 
cally. Hence, the minimum error probability is expected 
to approach that of two continuous POVMs. This is sup- 
ported by numerical calculations. The problem of finding 
the optimal E&D machine for arbitrary n appears to be 
a hard one and is currently under investigation. Here we 
will give the absolute optimal E&D machine for n = 1 
and, also, we will compute the minimum error proba- 
bility for both M and M' being the continuous POVM 
that is optimal for estimation. The later, as mentioned, 
is expected to attain the optimal E&D error probability 
asymptotically. 

We can obtain an upper bound on (flU|) by applying 
the Schwarz inequality. We readily find that 



Am,m' < Y,P^ r o- r W 2 

V a i 

where we have used that Y^ a Pa r = Y^iPi r i — u ; as 
follows from the POVM condition on M and M'. The 
maximum norm of Tq and r\ is bounded by 1/3 [the 
shrinking factor rj for n = 1]. Thus 

A m ,m< < V2/3 < 1/V3 - A LM , (23) 

where the value of A LM can be read off from Eq. ([3]). 
The E&D bound y/2/3 is attained by the choices 
Mf/j, = [t/l] and M' + ,_ = [+/—], where we have used 

the definition |±) = ± \i))/V2. 

For arbitrary n, a simple expression for the error prob- 
ability can be derived in the continuous POVM case, 
M = M' = {d n U s {ip° ]C/]} sS §2, where s is a unit vec- 
tor (a point on the 2-sphere § 2 ) and U s is the represen- 
tation of the rotation that takes the unit vector along 
the z-axis, z, into s. Here s labels the outcomes of the 
measurement and thus plays the role of a and i. The 



continuous version of (1191) can be easily computed to be 

^ D = »/«» I— I- 5^5) ■ (") 

Asymptotically, we have F C E&D = 1/6 + 2/(3n) + . . . . 
Therefore, the excess risk, which we recall is the differ- 
ence between the average error probability of the ma- 
chine under consideration and that of the optimal dis- 
crimination protocol for known qubit states (1/6), is 
jj>E&D = 2/(3n) + . . . . This is twice the excess risk of 
the optimal programmable machine and the optimal LM, 
which can be read off from Eq. Q: 

r lm = ROp t = 1 + _ _ / 25 n 
3n 

For n = 1, Eq. flU) leads to i? E&D = (4 - y/2)/12. 
This value is already 15% larger than excess risk of the 
optimal LM: R LM = (4 - V3)/12. 

Robustness of LMs. So far we have adhered to 
the simplifying assumptions that the two types of states 
produced by the source are pure and, moreover, exactly 
equal in number. Neither of these two assumptions is 
likely to hold in practice, as both, interaction with the 
environment, i.e., decoherence and noise, and statistical 
fluctuations in the numbers of states of each type, will 
certainly take place. Here we prove that the performance 
of the optimal LM is not altered by these effects in the 
asymptotic limit of large TS. More precisely, the excess 
risk of the optimal LM remains equal to that of the opti- 
mal programmable discriminator to leading order in 1/n 
when noise and statistical fluctuations are taken into ac- 
count. 

Let us first consider the impact of noise, which we will 
assume isotropic and uncorrelated. Hence, instead of pro- 
ducing [ipo/i], the source produces copies of 

A)/i = r#o/i] + (l-r)-, 0<r<l. (26) 

In contrast to the pure qubits case, where [ip®^] be- 
longs to the fully symmetric invariant subspace of max- 
imum angular momentum j = n/2, the state of A/C is 
now a full-rank matrix of the form p®J\- Hence, it has 

projections on all the orthogonal subspaces Sj ® C"i , 
where Sj = span({|j, Ta)Y m __-) and <C v j is the mul- 
tiplicity space of the representation with total angular 
momentum j (see Sec. Methods for a formula of the mul- 
tiplicity v™), and j is in the range from (1/2) to n/2 
if n is even (odd). Therefore /9®y™ is block-diagonal in 
the total angular momentum eigenbasis. The multiplicity 
space C 1 ^ carries the label of the v™ different equivalent 
representations of given j, which arise from the various 
ways the individual qubits can couple to produce total 
angular momentum j. For permutation invariant states 
(such as PqJ\), this has no physical relevance and the only 
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effect of C v i in calculations is through its dimension i/™ . 
Hence, the multiplicity space will be dropped throughout 
this paper. 

The average states now become a direct sum of the 
form 

J d^, #x pf n+1) ® P f n = J2 Pf *o,€ . (27) 



#o #i pf n ® pf = ( 28 ) 



where we use the shorthand notation £ = {Ja,Jc} [each 
angular momentum ranges from (1/2) to n/2 for n 
even (odd)], and p 1 ^ = Pj A pj c is the probability of any of 
the two average states projecting on the block labeled £. 
Hence, 
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LM 
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(29) 



The number of terms in Eq. flU, is [(2n + 3 ± l)/4] 2 
for even/odd n. It grows quadratically with n, in con- 
trast to the pure state case for which there is a single 
contribution corresponding to ja — jc = n/2. In the 
asymptotic limit of large n, however, a big simplifica- 
tion arises because of the following results: for each £ of 
the form £ = {j,j} (Ja = Jc = j), the following relation 
holds (see Sec. Methods) 



r(J z 



. 2 J 



(30) 



where cr^ are the average states |2]) for a number of 2j 

pure qubits. Here (J z )j is the expectation value re- 
stricted to Sj of the z-component of the angular mo- 
mentum in the state p® n , where p has Bloch vector rz. 
Eq. ([50)1 is an exact algebraic identity that holds for 
any value of j, n and r (it bears no relation whatso- 
ever to measurements of any kind). The second result 
is that for large n, both p" and »™ become continu- 
ous probability distributions, Pn{%A) and p n \xc), where 
xa/c = 2jA/c/ n e [0,1]. Asymptotically, they ap- 
proach Dirac delta functions peaked at xa = xc = r (see 
Sec. Methods). Hence, the only relevant contribution 
to A LM comes from £ = {rn/2,rn/2}. It then follows 
that in the asymptotic limit 



2(Jz)rn/2 



(°0 



(31) 



Hence, mixed-state quantum classification using a TS 
of size 2n is equivalent to its pyre-state version for 
a TS of size 2nr, provided n is asymptotically large. 
In particular, our proof of optimality above also holds 
for arbitrary r € (0, 1] if the TS is sizable enough, 
and i? LM ~ R opt . This result is much stronger than ro- 
bustness against decoherence, which only would require 
optimality for values of r close to unity. 



From Eqs. (|29|) and (|3Tj) one can easily compute A LM 
for arbitrary r using that [l8| (J z )j — j — (1 — r)/(2r) 
up to exponentially vanishing terms. The trace norm 
of a™ — cr™ can be retrieved from, e.g., Eq. ([23]) . For rn 



pure qubits one has || cr™ — a i n II i- 
Aftcr some trivial algebra we obtain 



(4/3) [1 - 1/M] 



P e LM = i-- + — + o(n- 1 ) 
2 3 3m 1 ' 



(32) 



for the error probability, in agreement with the optimal 
programmable machine value given in fill ], as claimed 
above. This corresponds to an excess risk of 



R 



LM 



1 



3rn 



(33) 



In the non-asymptotic case, the sum in Eq. (1291) is not 
restricted to £ = {j,j} and the calculation of the ex- 
cess risk becomes very involved. Rather than attempting 
to obtain an analytical result, for small training samples 
we have resorted to a numerical optimization. We first 
note that Eqs. (|7|) through (jlip define a semidefinite pro- 
gramming optimization problem (SDP), for which very 
efficient numerical algorithms have been developed [17] . 
In this framework, one maximizes the objective func- 
tion A LM [second equality in Eq. ©] of the SDP vari- 
ables D, m > 0, subject to the linear condition ([8]). We use 
this approach to compute the error probability, or equiv- 
alent^, the excess risk of a LM for mixed-state quan- 
tum classification of small samples (n < 5), where no 
analytical expression of the optimal seed is known. For 
mixed states the expression of r-j- and Q m can be found 
in Sec. Methods, Eqs. (gDJ through (|4"2"|) . 

Our results are shown in Fig. [TJ where we plot i? LM 
(shaped dots) and the lower bounds given by R opt (solid 
lines) as a function of the purity r for up to n — 5. We 
note that the excess risk of the optimal LM is always 
remarkably close to the absolute minimum provided by 
the optimal programmable machine and in the worst case 
(n = 2) it is only 0.4% larger. For n = 1 we see that 
R LM = R° pt for any value of r. This must be the case 
since for a single qubit in A and C one has Ja — jc = 1/2, 
and Eq. ([H holds. 

We now turn to robustness against statistical fluctu- 
ations in the number of states of each type produced 
by the source. In a real scenario one has to expect 
that ja = «a/2 ^ nc/2 = jc, riA + ns = 2n. Hence, 
has the general form (|40l) , which gives us a hint that our 
choice = f2 rn= o may not be optimal for finite n. This 
has been confirmed by numerical analysis using the same 
SDP approach discussed above. Here, we show that the 
asymptotic performance (for large training samples) of 
the optimal LM, however, is still the same as that of the 
optimal programmable discriminator running under the 
same conditions (mixed states and statistical fluctuations 
in n A /c)- 

Asymptotically, a real source for the problem at hand 
will typically produce Ua/c — n ± S^/n mixed copies of 
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FIG. 1. (Color online) Excess risk R LM (points) and its corre- 
sponding lower bound R opt (lines), both as a function of the 
purity r, and for values of n ranging from 1 to 5 (from top to 
bottom). 

each type. In Sec. Methods, it is shown that the rela- 
tion (f3"Tj) still holds in this case if n is large. It reads 

^-^^(i-^K"-^) 04) 

(5 first appears at order n~ 3 / 2 ). Hence, the effect of 
both statistical fluctuations in Ua/c an d noise (already 
considered above) is independent of the machine used 
for quantum classification (i.e., it is the same for LM, 
programmable machines, E&D, ...). In particular, the 
relation R LM = i? opt , between the excess rate of 

the optimal LM and its absolute limit given by the op- 
timal programmable discriminator still holds asymptoti- 
cally, which proves robustness. 

To illustrate this, let us consider the effect of statis- 
tical fluctuations in ua/c f° r pure states. The optimal 
programmable machine for arbitrary ua , n b and nc was 
presented in The error probability for the case at 

hand (ng = 1) is in Sec. Methods. From its asymptotic 
expansion when ua and nc are both large one readily 
has 

R° pt = U— + —)+... . (35) 
6 \n A n c J 

We see that when ua/c = n± S^/n (i.e., when statisti- 
cal fluctuations in ha/c are taken into account) one still 
has i? opt ~ l/(3n) ~ R LM . 

DISCUSSION 

We have presented a supervised quantum learning ma- 
chine that classifies a single qubit prepared in a pure but 
otherwise unknown state after it has been trained with 
a number of already classified qubits. Its performance 
attains the absolute bound given by the optimal pro- 
grammable discrimination machine. This learning ma- 
chine does not require quantum memory and can also 



be reused without retraining, which may save a lot of 
resources. The machine has been shown to be robust 
against noise and statistical fluctuations in the number 
of states of each type produced by the source. For small 
sized training sets the machine is very close to optimal, 
attaining an excess risk that is larger than the absolute 
lower limit by at most 0.4%. In the absence of noise and 
statistical fluctuations, the machine attains optimality 
for any size of the training set. 

One may rise the question of whether or not the sep- 
arated measurements on the training set and data qubit 
can be reversed in time; in a classical scenario where, 
e.g., one has to identify one of two faces based on a 
stack of training portraits, it is obvious that, without 
memory limitations, the order of training and data ob- 
servation can be reversed (in both cases the final de- 
cision is taken based on the very same information). 
We will briefly show that this is not so in the quan- 
tum world. In the reversed setting, the machine first 
performs a measurement D, with each element of rank 
one itfjtjwt) and stores the information (which of the 
possible outcomes is obtained) in the classical memory to 
control the measurement to be performed on the train- 
ing set in a later time. The probability of error con- 
ditioned to one of the outcomes, say f, is given by the 
Helstrom formula P e T = (1- || T t ||i /2)/2, where T t is 
defined in Eq. (dUJ). Using Eq. (JIT) one has || T t || x = 

d n 2d nli Em ; m' \m - m'\ = n/[3(n + 1)]. The averaged 
error probability is then 

In the limit of infinite copies we obtain ~ 5/12, 

which is way larger than P£ ~ 1/6. The same mini- 
mum error probability of Eq. Q36[) can be attained by per- 
forming a Stern-Gerlach measurement on the data qubit, 
which requires just one bit of classical memory. This 
is all the classical information that we can hope to re- 
trieve from the data qubit, in agreement with Holevo's 
bound 19). This clearly limits the possibilities of a cor- 
rect classification — very much in the same way as in face 
identification with limited memory size. In contrast, the 
amount of classical information "sent forward" in the op- 
timal learning machine goes as the logarithm of the size of 
the training sample. This asymmetry also shows that de- 
spite the separability of the measurements, non-classical 
correlations between the training set and the data qubit 
play an important role in quantum learning. 

Some relevant generalizations of this work to, e.g., 
higher dimensional systems and arbitrarily unbalanced 
training sets, remain an open problem. Another challeng- 
ing problem with direct practical applications in quan- 
tum control and information processing is the extension 
of this work to unsupervised machines, where no human 
expert classifies the training sample. 
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METHODS 



Covariance and structure of £ 



Block-diagonal form of p®" 

The state p®" of n identical copies of a general qubit 
state p with purity r and Bloch vector rs, has a block 
diagonal form in the basis of the total angular momentum 
given by 

p®" = £^;^- 

3 i 

Here j = (1/2), . . . , n/2 if n is even (odd), Ij is the 
identity in the multiplicity space C"i , of dimension z/ 1 
(the multiplicity of the representation with total angular 
momentum j), where 

2j + l 



'i \n/2 - j) n /2 + j + 1 ' 

The normalized state pj, which is supported on the 
representation subspace Sj — span{ \j, m) } of dimen- 
sion 2j + l = dij , is 

Pj =U S [ ^2 a m b' m ] U l> 



\m=-J 



where 



and 



1 / 1 - T 

i J = — 



1 \ fl + r 



j-m 



2j+l 



1 + r 



j+m 



(37) 



2j+l - 



1 - r\ 



so that J2m=~j a m = 1> an d we stick to our shorthand 
notation [ • ] = | • ) ( • | , i.e., [j, m] = \j, m)(j, m\. The mea- 
surement on p® n defined by the set of projectors on the 
various subspaces Sj will produce pj as a posterior state 
with probability 

1 - r z ^ 



One can easily check that J2jP*j = 1- 

In the large n limit, we can replace p™ for a continuous 
probability distribution p n (%) in [0, 1], where x — 2j/n. 
Applying Stirling's approximation to Pj one obtains: 



Pn(x) 



1 



2tt JY~x^ r(l + ar) 
where H(s \\ t) is the (binary) relative entropy 

H(s || t) = s log t + (i _ s ) bg izl. 

The approximation is valid for x and r both in the open 
unit interval (0,1). For non-vanishing r, p n {x) becomes 
a Dirac delta function peaked at x = r, Poo{ x ) — 6(x — r), 
which corresponds to j = nr/2. 



We start with a POVM element of the form E — 
J dull E W . Since must be a rank-one projector, it 
can always be written as D u = u^ [ j" ] uj^ for a suitable 
SU(2) rotation u u . Thus, 



Eq 



E 



du (u A cLuU\ c ) ® (uu^ItluW) 



We next use the invariance of the Haar measure du to 
make the change of variable: uu u — » u' and, accordingly, 
U ac U' AC U^ AC . After regrouping terms we have 

E = J2 fdu' (u AC Ul AC L n U uAC U' AC )®(u'{t}u^) 

= ldu> u ac (^2uI ac l u u uA ^Ju> ac ®Kit]^) 

Idu (Uao^U^o) ® («[t]u f ) , 

where we have defined ft — J^n U^ AC L n U aA c > 0. 
The POVM element E\ is obtained by replac- 
ing [f] by [4-] in the expressions above. From the 
POVM condition ^2 L u = it immediately follows 

that J duU A cQU^ A c — %-ac, where Iac is the identity 
on the Hilbert space of the TS, i.e., Iac = ^-A ® J-C- 
Therefore £ = {Uac ^ ^t4c}su(2) is a covariant POVM. 
The positive operator f2 is called the seed of the covariant 
POVM L. 

Now, let u z (ip) be a rotation about the z-axis, which 
leaves [f] invariant. By performing the change of vari- 
ables u — > u'u z (ip) [and Uac U' AC U z Ac(f)] in 
the last equation above, we readily see that f2 and 
U zAciv) QUzAci^) b°th give the same average oper- 
ator Eq for any tp £ [0, Aii). So, its average over ip, 

^U x (<p)SlUi(<p), 

can be used as a seed w.l.o.g., where we have dropped 
the subscript AC to simplify the notation. Such a seed 
is by construction invariant under the group of rotations 
about the z-axis (just like [f]) and, by Schur's lemma, a 
direct sum of operators with well defined magnetic num- 
ber. Therefore, in the total angular momentum basis 
for AC, we can always choose the seed of L as 



The constrain (jHJ) follows from the POVM condition 
Iac = / duU QU> and Schur's lemma. The result 
also holds if A and C have different number of copies 
(provided they add up to 2n). It also holds for mixed 
states. 
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Wigner's 6j-symbols 

Let us consider three angular momenta j\ , j 2 , j$ that 
couple to give a total J. Note that there is no unique 
way to carry out this coupling; we might first couple j\ 
and j 2 to give a resultant j\ 2l and couple this to j'3 to 
give J, or alternatively, we may couple j\ to the resul- 
tant ^'23 of coupling j 2 and j'3 . Moreover, the intermedi- 
ate couplings can give in principle different values of j i2 
or j'23 which, when coupled to 33 or j\, end up giving 
the same value of J. All these possibilities lead to lin- 
early independent states with the same J and M, thus 
they must be distinguished by specifying the intermedi- 
ate angular momentum and the order of coupling. There 
exists a unitary transformation that maps the states ob- 
tained from the two possible orderings of the coupling; 
Wigner's 6j-symbols [16J, denoted in the next equation 
by { ^ }, provide the coefficients of this transformation: 



( ( ji h )ji2 , 33 ; J, M \ji , (j 2 33 )j23 ; J, M) 



= (_ 1 y 1 +h+is+ J^ 2 j 12 + i)(2j 23 + 1) 



31 32 312 
h J 323 



Note that this overlap is independent of M. For the 
proof of optimality of the LM, we couple subsystems A, B 
and C in two ways: A(CB) and (AC)B to produce the 
states \jA,(jc 3b)jcb;J,M) and \{j A 30)3 AC ,3b; J, M), 
which we denote by | J, M) AfpB) and | J, M) (AC)B respectively 
for short. The various angular momenta involved are 
fixed to jA = jc = f , 3b = §, 3 Ac = 3, 3cb = f + \, 
whereas J — j ± |. With these values, the overlaps we 
need are given by 



U i \i llii o, \\ac)b — 



±(j 



2(n + l) 



Derivation of Eqs. ([30]) and (f34)l 

Let us start with the general case where £ = {j, j'}. 
To obtain <Jq ^ we first write Eqs. (j2"T)l and ([25]) as the 
SU(2) group integrals 

°°' e = / dM ^ f ^ a " 1 ^' ^ ® U ^ AB 
®jdu'U' c { J2 a Ci3',™}c\ 

\m=-j' J 

where a? m is given in Eq. ()37[) . p$ is the mixed state po, 
Eq. (|26l) . of the qubit B. We next couple A with B (more 
precisely, their subspaces of angular momentum j) using 
the Clebsch-Gordan coefficients 

\(j + l,m+l\j,m;l,lf = 3 + | 1 , 



10' - h,m+ §|j,m; i 



I\|2 = 3 ~ m 
2 2j + l 



The resulting expressions can be easily integrated using 
Schur lemma. Note that the integrals of crossed terms of 
the form \ j, m)(j' , m\ will vanish for all j 7^ j'. We readily 
obtain 



'o,e- 



m=-j 



j + l + mr ^ j-mr l^i 

d 2 7-i . 



AS 



a 



'2.; 



I2j+ 1 



'2,7 



2j' 



where Igj is the projector on Sj and c?2j = 2j ' + 1 = 
dimSj. The superscripts attached to the various projec- 
tors specify the subsystems to which they refer. These 
projectors are formally equal to those used in Eq. ([2]) 
(i.e., I23 projects on the fully symmetric subspace of 2j 
qubits) and, hence, we stick to the same notation. Note 
that tr<jQ£ = 1, as it should be. 

We can further simplify this expression by introducing 
(J z )j = Ylm ma m> i- e -> the expectation value of the z- 
component of the total angular momentum in the state pj 
(i.e., of t 2 jJz^2j in the state pfj[) for a Bloch vector rz: 



f 3 + l+r{J z )j l^i , j-r(J z )j ^f-i\ 1 C 



d 2j 



Using the relation 



1 



AB 



l2j+l 



l 2:i 



d 2j -! 



2j' 
d 2j , 



t B _ t AB 
R 1 JL 2 



L 2j-1 - ^2j ^ -"-1 »-2j+l> 

and (j + l)/d 2 j + i = j/d 2 j_ 1 = 1/2, we can write 



fr(Jz)j &2j+i , 3-r(J z )j l-2j 



a 



J2j + 1 



7 2,7 



Similarly, we can show that 



a A 

d 2j 



(r{J z )j> ifjM-i 



j'-r(J a >. 




(38) 



(39) 



Therefore, if j' = j, 



1 AB 
L 2j + 1 

hj+i 



d 2j 



t\ a a BC ~ 

JLo„- ^2^+1 



L 2j 

d 2 j 



Z2j + 1 



Comparing with Eq. ^ , the two terms in the second line 
can be understood as the average states for a number of 
2j pure qubits, i.e., as and a 2 ^ respectively. Hence, 
if £ = {j, j} we have the relation 



r(J z 



.7 



2j 



2j 



which is Eq. ([30]) . It is important to emphasize that this 
equation is exact (i.e., it holds for any value of j, n and r) 
and bears no relation whatsoever to measurements (i.e., 
it is an algebraic identity between the various operators 
involved) . 

In the asymptotic limit, for ua and nc of the form 
tia/c — n ± bn a , n 1, a < 1, the probabili- 



ties p™ and p", are peaked at j 



rriji/2 and j' 



10 



rnc/2, as was explained above. Hence, only the aver- 
age state components cTq/i ^ with £ = {i>i'} such that 
j ~ (r/2)n(l + fon " 1 ) and f ~ (r/2)n(l - fen " 1 ) are 
important. From Eqs. (13"8")) and (l3"9l it is straightforward 
to obtain 

where we have used that [l|[ (J 2 )j — j — (1 — r)/(2r) 
up to exponentially vanishing terms. This relation, for 
the particular value of a = 1/2, is used in the proof of 
robustness, Eq. (jMf . 



In order to minimize the excess risk using SDP, we find 
it convenient to write Eq. Q in the form 

A LM = 2 max VVtr (T u n m ^), (41) 

where we recall that m = vtlac — m A + nic, and we 
assumed w.l.o.g. that the seed of the optimal POVM 
has the block form fi m = The POVM con- 

dition, Eq. ((8} must now hold on each block, thus 
for £ = {jA,jc}, we must impose that 

j 

J2{j^\n m ^\j,m)=2j+1, \j A -jc\<j<3A+jc (42) 

m=—j 



Calculation of 



Here we calculate =ti'B{[t]{< J o ^ ~ ""i^)}' where 
the average states are defined in Eqs. p7| and (|28|) . and 
explicitly given in Eqs. (1551) and (131))) for £ — {j, j'}. Let 
us first calculate the conditional state trs([t]o-p ^). For 

that, we need to express i^+i = J2 m [j + \,m] in the 
original product basis {\ja, mA) ® |t / i)}- Recalling the 
Clebsch-Gordan coefficients |(|, |; j, m\j + |,m + |)| 2 = 



(j 



l)/(2j + 1), one readily obtains 



[t] 



11 AB 



E 



j + l+TO 

2(i + l)d 2j - 



which can be written as 



tr, 



[t] 



,AB 
i 2j+l 



1 / 1 



2j 

i 2j 



d 2j J + 1 



where is the z component of the total angular mo- 
mentum operator acting on subsystem A. An analogous 
expression is obtained for tr# ([t]l^+i)- Substituting 



in Eqs. ((38D and J39 

pressions, one has T 



t 



and subtracting the resulting ex- 
E^ 1 !^ with 



r t,e- 



r(J : 



ZI3A 



■I 



J 



G 



2d 2jA d 2 j c 



3 A 3 a + 



1 



jc ic + 1 



> (40) 



where we have written £ = {jA, jc}i instead of £ = {j, j'} 
used in the derivation. For pure states, r — 1, ja — jc — 
n/2, (J z )n/2 = n/2, and we recover Eq. (fTTj) . 



Programmable machine for unbalanced training sets 

The minimum error probability of the optimal pro- 
grammable machine with a number of ha, 1 and np 
copies (ha > nc) in ports A, B and C respectively, is [11| 



t _l Do Dp 
4) Di D £»i 



k=0 



(n A ~n c + 2k+2) 



D D X (n A -n c +k+l)(k+l) 



1-4 



(A)+£>i) 2 {n A + l){n c + l) 



where D Q = (tia + 2)(n c + 1) and D x = (n^ + l)(ri c + 2) 
arc the dimensions of the average states cxo/i- The 
asymptotic form of this expression when ha and nc 
are both very large can be easily derived using Euler- 
Maclaurin's summation formula. The result up to sub- 
leading order is 



popt ^ ■ 
e ~ i 

which leads to Eq. (|35 



1 

UA 



1 

nc 
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